Skip to content

exp: general-agent#2525

Merged
mikasenghaas merged 35 commits into
mainfrom
exp/general-agent
May 25, 2026
Merged

exp: general-agent#2525
mikasenghaas merged 35 commits into
mainfrom
exp/general-agent

Conversation

@mikasenghaas
Copy link
Copy Markdown
Member

@mikasenghaas mikasenghaas commented May 17, 2026

Summary

  • Add general-agent to pyproject.toml (envs list, workspace members, uv sources) and pull pytest-asyncio into the dev group so the env's tests are runnable.
  • Add public configs/general_agent/ with three RLM configs using general-agent-solver-rlm, all logging to the general-agent-debug wandb project:
    • rl_qwen3_0p6b.toml — single-GPU smoke test
    • rl_qwen3_4b.toml — 4 train + 4 infer GPUs, max_steps=200
    • rl_qwen3_30b_a3b.toml — multi-node (1 train + 1 infer, dp=2 / tp=4), max_steps=400
  • Bump deps/research-environments to c752781 (was origin/main at time of write; main has since advanced — re-bump before merge if a refresh is wanted). Env version bumps:
    • ddbc 0.1.1 → 0.1.2
    • ddbc_rlm 0.1.5 → 0.1.6
    • deepdive 0.2.7 → 0.2.9
    • deepdive_rlm 0.2.11 → 0.2.13
    • general_agent 0.1.0 → 0.1.4
    • opencode_deepdive 0.1.15 → 0.1.16
    • rlm_deepdive 0.2.3 → 0.2.4
    • rlm_swe 0.3.4 → 0.4.2
  • Bump configs/private submodule pointer (now a merge commit 70c3503 that joins the PR's behavior-learning RESULTS writeups with main's rlm5 X-Session-ID header cleanup).
  • Skip vf-eval-style TOMLs (detected via top-level eval list) in tests/unit/test_configs.py::test_load_configs so non-entrypoint configs don't fail validation.
  • Document exp/ as the branch prefix for experiment branches in AGENTS.md.

Verification

  • uv sync --all-extras rebuilds general-agent==0.1.4; entry point general-agent-solver-rlm resolves and vf.load_environment("general-agent-solver-rlm") returns a ComposableEnv.
  • uv run pytest tests/unit/test_configs.py — 106 passed (covers all three new configs/general_agent/*.toml).

Note

Low Risk
Mostly new TOMLs, dependency wiring, and a targeted config-test skip; no changes to core training or auth paths in this diff.

Overview
Wires the general-agent research environment into the repo and adds RL experiment configs for general-agent-solver-rlm on Qwen3 at 0.6B (smoke), 4B, and 30B-A3B scales, all targeting the general-agent-debug W&B project.

Packaging: general-agent is added to the envs extra, uv workspace members/sources, and uv.lock (new editable general-agent==0.1.4; lock also bumps deepdive / opencode-deepdive versions shown in the diff). Dev deps gain pytest-asyncio for async env tests.

Configs: New configs/general_agent/rl_qwen3_{0p6b,4b,30b_a3b}.toml tune steps, seq length, GPU/deployment layout, orchestrator batch/rollouts, and inference parallelism for each model size.

Tests / docs: tests/unit/test_configs.py skips TOMLs with a top-level eval list (vf-eval, not prime-rl entrypoints). AGENTS.md documents exp/ branch prefix for experiment work.

Reviewed by Cursor Bugbot for commit 278ed64. Bugbot is set up for automated code reviews on this repo. Configure here.

@mikasenghaas mikasenghaas changed the title exp: general-agent exp: general-agent + behavior-learning May 18, 2026
mikasenghaas and others added 17 commits May 18, 2026 07:27
…ation configs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l in pre-run

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Source ~/.env directly in the shell before uv run rl instead; env vars
propagate to sbatch via --export=ALL.

Reverts 3703dc0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up #1395 which fixes the ParsedToolCall subscription bug in
renderer_client.from_native_response — previously raised
'ParsedToolCall' object is not subscriptable on every rollout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4B-Instruct hallucinated tool names and gave up after a few errors
(reward ~0.4% at step 2). Try the thinking variant which is better
at structured tool-use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cting reward

Set behavior_judge_model + behavior_reward_alpha=0.0 so the judge runs
and behavior_<key> metrics get logged, but final_reward stays equal to
task_reward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fecting reward

Same change as baseline: enable judge for metrics but alpha=0.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ics in prompt run

- prompt run: enable judge with alpha=0.0 so behavior_<key> metrics
  get logged but final_reward stays equal to task_reward (same setup
  as baseline)
- all four configs: max_steps 1000 → 200 to keep ablations bounded

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m-id fix

Picks up PrimeIntellect-ai/research-environments@769298b1 which forwards
PRIME_TEAM_ID as X-Prime-Team-ID on behavior judge requests, so the judge
bills the team balance instead of the user's personal balance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up eaaabf3c which makes final_reward use state.get() so judge
failures don't zero out task_reward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous runs (cp=1, 32K) saw 6-9% truncation rate and output_tokens
hitting the 32K cap on long trajectories. Double the seq_len and
max_model_len; cp=2 keeps per-rank activation memory flat under the
2x context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up fdca6d76 which logs behavior_reward as the raw judge mean
(independent of task_reward) and moves the solution gate into
final_reward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 4 phase-2 runs completed step 200 with promising trajectories.
Extend max_steps to 400 to continue training from the step_200
checkpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bmodule

The general-agent + behavior-learning configs are not meant for the public
prime-rl repo. Move them into the research-configs submodule mounted at
configs/private/ so they share access controls with the rest of our
internal experiment configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas changed the title exp: general-agent + behavior-learning exp: general-agent May 22, 2026
mikasenghaas and others added 6 commits May 22, 2026 10:08
…g RESULTS.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e eval-config helper

- Bump deps/research-environments to origin/main HEAD (general-agent 0.1.4)
- Add configs/general_agent/{rl_qwen3_0p6b_debug,rl_qwen3_4b,rl_qwen3_30b_a3b}.toml
  using the general-agent-solver-rlm env
- Rename is_vf_eval_config -> is_eval_config in tests/unit/test_configs.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…agent-debug wandb project

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ULTS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t step counts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mikasenghaas mikasenghaas requested a review from samsja May 22, 2026 04:50
@mikasenghaas mikasenghaas marked this pull request as ready for review May 22, 2026 04:51
Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	configs/private
@mikasenghaas mikasenghaas merged commit 0057f3b into main May 25, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants